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Abstract 

The comparative genomics of prokaryotes has shown the presence of 
conserved regions containing highly similar genes (the 'core genome 1 ) and 
other regions that vary in gene content (the 'flexible' regions). A significant part 
of the latter is involved in surface structures that are phage recognition targets. 
Another sizeable part provides for differences in niche exploitation. 
Metagenomic data indicates that natural populations of prokaryotes are 
composed of assemblages of clonal lineages or "meta-clones" that share a 
core of genes but contain a high diversity by varying the flexible component. 
This meta-clonal diversity is maintained by a collection of phages that equalize 
the populations by preventing any individual clonal lineage from hoarding 
common resources. Thus, this polyclonal assemblage and the phages preying 
upon them constitute natural selection units. 
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The pan-genomic world 

Bacterial and archaeal genomes show a surprising diversity in gene 
content even in otherwise very similar strains 1,2 . Some parts of the ge- 
nome are shared and keep a high sequence similarity. The 95% average 
nucleotide identity (ANI) appears as a kind of "magic number" that 
fits the definition of most classical species, replacing the pre-genomic 
70% DNA-DNA hybridization "golden rule" 3 . This is the 'core' of 
prokaryotic genomes (surprisingly similar figures hold for Bacteria 
and Achaea in spite of their highly divergent molecular biology) 4 5 . 

However, the most remarkable finding of prokaryotic genomes 
is the presence of other genomic regions that are extremely vari- 
able and differ in gene content and synteny from one strain to an- 
other 6 . One consequence is that the diversity of genes found in a 
single prokaryotic species is stunning; for example in Escherichia 
coli with about 5000 genes per genome, it is estimated that there 
could be about 45,000 different gene families in its pan-genome 7 . 
The 'open-ness' of a bacterial pan-genome for a species ranges 
from roughly twice the size of an individual genome, to more than 
ten-fold 6 8 . Thus, the genetic diversity hidden in the prokaryotic do- 
main is much higher than initially suspected. This raises several 
questions regarding the biology and evolution of prokaryotic cells. 
How is this enormous diversity generated and, even more impor- 
tantly, maintained? How does it impact an organism's survival strat- 
egies and adaptation potential? These questions are fundamental 
gaps in our knowledge of the largest and oldest group of organisms 
on the planet. 

The availability of multiple genomes from the same bacterial species 
has advanced greatly our knowledge of prokaryotic pangenomes 28 . 
Furthermore, the availability of large metagenomic datasets permits 
the analysis of the presence or absence of parts of the genomes of 
microbes that are known to be abundant in a specific habitat 912 ; 
this bypasses the limitations and biases of pure culture retrieval of 
strains. We can begin to see general trends now in the pools of genes 
in the core and 'flexible' components of prokaryotic pan-genomes. 

The flexible pool and the cell surface 

One major problem of the flexible pool is its remarkable diversity, 
which makes it hard to classify its genes. Being less widely dis- 
tributed, they are more difficult to annotate and often appear as 
hypothetical proteins. However, as more genomes are sequenced, 
patterns start to be discernible. Particularly informative are the clus- 
ters of 'flexible genes' collected in genomic islands (GIs), which 
contain groups of contiguous genes, making functional inference 
much more reliable. Much of the flexible pool is collected in GIs of 
10Kb or more. In this paper we will focus on some of these islands 
that are present in most (or all) strain genomes but containing dif- 
ferent genes {i.e. they are in the same genomic context and code 
for the same type of function or structure but the genes are only 
distantly related, if at all). For the sake of clarity we will designate 
these genomic islands found in many strains but containing differ- 
ent genes 'flexible Genomic Islands' (fGIs). 

One kind of fGI that appears to have universal distribution encodes 
for the synthesis of exposed structures of the prokaryotic cell. One 
of the most remarkable examples of this phenomenon is the gene clus- 
ter that codes for the synthesis of the O-chain of the Gram-negative 



lipopolysaccharide (LPS; 13 ). Classically known as the 'O-antigen', 
the diversity of this exposed envelope in Salmonella or Escherichia 
has been known for many years 1415 . The O-chain is a repeat-unit 
polysaccharide, the monomeric repeat has generally between two 
and six sugar residues. O-chains are extremely variable in the na- 
ture, order and linkage of the different sugars 1316 . This complex 
polysaccharide is very important for the survival of the cell, provid- 
ing the appropriate hydrophilic envelope to allow nutrient imports 
towards the cell 17 . After subculture in the laboratory many strains 
lose part of the polysaccharide, originating "rough" mutants, which 
probably have diluted the critical importance that it had for the cell's 
lifestyle. However, the importance of the O-chain for antibiotic sus- 
ceptibility is long known, illustrating how the permeability proper- 
ties of the cell vary with small changes affecting this structure 18 . 
Further, the gene cluster for such an important cell component is 
extremely variable. This diversity has classically been explained 
by the advantage of the concomitant antigenic variation that could 
prevent the host from identifying, and eventually expelling, all the 
strains of one of these species. However, even accepting this sim- 
plistic inference from host-microbiome interaction, the variability 
found in free-living bacteria is comparable (if not higher) than those 
of specialized pathogens or symbionts 17 . 

As an example, Figure 1 shows the fGIs detected in the genomes 
available of Candidatus Pelagibacter ubique, probably the most 
abundant pelagic marine microbe. Further, in addition to the O-chain 
gene cluster, all known exposed structural motives that can vary 
reflect similar genomic patterns of variation. For example, capsu- 
lar or slime layer polysaccharides 519 , the teichoic acids of Gram- 
positives 20 , or the S-layer glycoproteins of archaea 4 21 all seem to 
be located in fGIs. Other exposed structures that are also typical 
components of the flexible gene pool are flagella, pili and porins. 
An alternative way to view the diversity found in all these cell com- 
ponents is that they all are important phage recognition targets. 

Phages, phages everywhere 

Viruses and their hosts are extremely entangled entities. In many 
environments there are on average about 10 phages for every one 
bacterium 22 , which means that bacteria are under constant attack. 
There are millions of viruses in every drop of ocean water; on av- 
erage, it is estimated that about a mole of viruses (6xl0 23 ) attack 
bacteria every minute in the oceans 5 6 . Some estimates suggest that 
a quarter of newly photosynthesized carbon in marine environments 
travels through the "viral shunt", moving it directly to dissolved or- 
ganic carbon before grazers or other consumers can access it 6 ' 7 . The 
diversity of bacteriophages is quite large, and dynamic, changing 
with time in a given environment 23 . Presumably, this change in virus 
diversity reflects changes in bacterial diversity, since phages are ob- 
ligate parasites, and many phages can only infect specific bacterial 
strains. It has been estimated that 60-70% of the bacterial genomes 
sequenced to date contain prophages 8 9 (Rob Edwards, Personal 
communication). About two-thirds of all sequenced proteobacteria 
(no T) Gamma-proteobacterial and low GC Gram-positive bacteria 
harbor prophages 10 ; thus the phages are also part of the pan-genome 
for many of these organisms. 

The surface of a cell is what is presented to the world (both friend 
and foe alike); in environments where bacteria are under constant 
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Figure 1. BLASTatlas for Pelagibacter ubique, showing fGIs along the chromosome. 



attack from viruses, the need to often change the shape and appear- 
ance of surface proteins is compelling. Thus a good evolutionary 
strategy would be to vary these proteins, both by changing their 
structures, but also by distributing them amongst other bacteria 
within the population, where possible. 

Can the need to diversify phage receptors explain the enormous 
diversity of the pan-genome? Certainly not, but it could be respon- 
sible for a large part. However, the other major component of the 
flexible pool might also be preserved via phage predation 12 . 

The flexible pool and niche partitioning 

Many components of the flexible pool are involved in niche par- 
titioning: 1) transport of substrates and the cognate metabolic 
pathways required for their assimilation by the cell, 2) regula- 
tion, such as two component systems involved in fine tuning the 
response to environmental stimuli 3) respiratory chains and or 
protective mechanisms involved in different oxygen or light re- 
lationships. This has been found to apply to both bacteria 1124 27 
and archaea 4 . For heterotrophic osmotrophs the transporter bag- 
gage carried in the genome is determined largely by lifestyle 
and niche specialization. Accordingly the transporters found in 
different clonal lineages are extremely variable and are typical 
components of the flexible gene pool. In a remarkably parallel 
way, hli genes (coding for high-light inducible proteins) pre- 
sent in marine picocyanobacteria might influence the fine light 



qualities exploited by these widespread phototrophs and are 
also typical components of the flexible pool of these microbes 28 . 
A similar story is depicted by the tonB receptors involved in the 
transport of micronutrients 29 . 

It is important to emphasize here that fGIs related to phage sen- 
sitivity such as the O-chain of the LPS and fGIs related to niche 
specialization are genetically linked in a single replicon so that 
negative selection by phage predation would compensate au- 
tomatically positive selection by overly efficient exploitation of 
resources. For example, a sudden increase in the concentration of 
nutrients that are exploited by a clone that might lead to a major 
clonal expansion would be kept in check by the increase in the 
concentration of the linked phage receptors 12 . This mechanism of 
population control although negative for the short term prevalence 
of the clone might be good for its long term survival since it main- 
tains the complexity of the community and its endurance. 

The pan-selectome 

The unit of selection has been a major conundrum in evolutionary 
biology 30 . Historically the proposals have been, according to times 
and fashions, going from the gene 31 to the community 32 or even the 
planet 33 . In 1997 Ernst Mayr defined the unit of selection as "a dis- 
crete entity and a cohesive whole, an individual or a social group, 
the survival and successful reproduction of which is favored by 
selection owing to its possession of certain properties" 34 . 
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We would like to advance here the idea that in nature, the se- 
lection unit in a prokaryotic community or assemblage is an 
ensemble of clonal lineages that share the same (or highly 
similar) core genome but differ in the flexible gene comple- 
ments regions (i.e. the selection unit at the genomic level 
would be the pan-genome) Box 1. These meta-clonal popula- 
tions are maintained and equalized by phage predation 12 in an 
analogous way as the immune system in a mammal maintains 
in check tumors. Thus, phage populations should be consid- 
ered as belonging to the same selection unit, not only because 
they are part of the pan-genome (which they often are) but 



because they are instrumental for its long term preservation 
(Figure 2). Furthermore, the different clonal lineages as retrieved 
by pure culture have little chance of succeeding in nature (or in 
complex biotechnological processes such as dairy or wine pro- 
duction or sewage treatment). 

Although this idea invokes elements of group selection, a controver- 
sial theory of evolutionary biology 35 , it is being proposed to explain 
evolution of prokaryotic populations, about which very little evolu- 
tionary theory has been solidly established, particularly outside of 
the walls of the test tube. 
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Figure 2. A prokaryotic population with cells and phages, as depicted by the Constant-Diversity model. The genome is indicated as a 
solid blue circle (core) with two flexible genome islands fGIs (see text) in different colors to indicate different genes but coding for the same 
functions. The two fGIs indicated code for the O-chain of the lipopolysaccharide (in a Gram negative bacterium) and for a set of transporters. 
In reality there are many more fGIs (often four or five or more); further many differential transporters or other niche exploitation related features 
can be coded in small flexible islands or islets interspersed along the core, but they are all genetically linked to the O-chain and other 
surface related fGIs that are major recognition targets for phages. Three types of phages and receptors have been indicated, by different 
geometric forms. This set of cellular clones and phages is in equilibrium since the disproportionate increase of cells or phages is prevented 
by the density dependent kinetics of phage infection. For example if one clone increases over a certain threshold it will be over-preyed by its 
phages, until population returns to normal following a classical Lotka-Volterra predator/prey equilibrium. This is favorable for the meta-clonal 
population, since it keeps different lineages with complementary ecological skills acting in tune. This meta-clonal/viral population can be 
selected as an unit to exploit similar environments such as the water column in the ocean that is very similar worldwide. 
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Data supporting group selection in prokaryotes 

One of us proposed a Constant-Diversity model based on the distri- 
bution of gene functions in the genomic islands that under-recruit in 
metagenomes, where the core genome recruits at high similarity 12 . 
Since then, several independent workers have found data supporting 
this notion. 

One of the most convincing demonstrations that different fGIs in 
prokaryotes are largely involved in phage sensitivity is the work 
of Avrani et a/. 36 , in which the resistance to phages in several iso- 
lates of Prochlorococcus could all be assigned to mutations taking 
place in Genomic Island 4 of this microbe identified previously 9 , 
as involved in the synthesis of the O-chain of the lipopolysaccha- 
ride. By measuring the frequencies of mutants in metagenomic 
datasets the authors conclude: "abundant Prochlorococcus popu- 
lations belonging to a single ecotype with common physiological 
and ecological characteristics are actually an assortment of sub- 
populations with different susceptibilities to co-occurring phages" 
and "Thus, large numbers of taxonomically identical organisms, 
fulfilling the same ecological role, are probably maintained in the 
environment as a result of micro-diversity in phage susceptibility 
regions" 36 . A similar situation is found in Synechococcus where 
also resistant mutants were found to have altered genes in the 
O-chain region 37 . 

Many other recent developments support the coexistence of complex 
populations of phages and their hosts in natural communities 2338 41 . 
There is also evidence that phages contribute to keep high levels of 
host diversity 42 and that diversity promotes productivity 43 . 

Recently there have been other alternatives used to explain the di- 
versity of pan-genomes by some type of kin selection, such as the 
so-called Black Queen hypothesis 44 in which some genes present 
in certain lineages can supply the functions for other clonal line- 
ages. Along the same lines, Teusink et al. describe a "game theory" 
explanation for the coexistence of proteolytic and non-proteolytic 
strains in dairy multi- starter cultures 45 . However, these models only 
make even more critical the role of phages to keep the proper ratios 
among the different cooperating lineages. 

Conclusion 

We are proposing here a way of thinking about prokaryotic communi- 
ties in which cells with a core above 95% ANI, but with a wide diversi- 
ty of flexible genome complement, and phages praying on them, form 
an evolutionary selection unit, the "prokaryotic selecton". This has 
important repercussions for the evolutionary biology of prokaryotes. 
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Box I.The pan-selectome and the evolutionary unit (a.k.a. 
species). 

The species as an evolutionary unit is at the centre of the modern 
Neo-Darwinian synthesis. Species are not just considered as 
a taxonomic level of classification but as a kind of biological 
entity or level of organization beyond the cell or the individual 46 . 
However, this "natural species concept" has been difficult to 
transfer to prokaryotes due to the lack of sexual reproduction 
and the unpredictable levels of recombination (particularly 
when including the illegitimate one) of prokaryotic genomes 47 . 
If the pan-selectomes described here are the units of selection, 
they might also fulfill the requirements to be considered natural 
evolutionary units. However, this requires a mechanism that 
provides the discontinuities in genetic diversity found in nature. 
Metagenomic data show that there are discontinuities located at 
ca 95% ANI, beyond which a gap indicates an empty space in 
the sequence diversity space 48 . Of course this only applies to the 
core genome of the meta-clones but still reflects a coherence that 
requires an evolutionary drive reminiscent of the breeding barriers 
found in animal species for example. How could the meta-clones 
explain such discontinuities? This critical question remains to be 
answered. However, we would like to advance one hypothesis that 
we call "the maverick hypothesis". Let's assume that a meta-clone 
of bacteria and phages is established somewhere, for example, 
exploiting the degradation of chitin, a common component in the 
water column of the ocean. A pan-genome evolves that allows 
for the efficient exploitation of the multiple resources associated 
to this polysaccharide (other accompanying organic molecules, 
phosphorus and nitrogen source etc.). The populations of this 
microbe have also a complement of phages to keep a well- 
equilibrated consortium. The physical proximity, near where the 
resources (such as zooplankton remains) are found facilitates 
genomic homogenization by homologous recombination. The 
pools of genes in the flexible genome can diverge enormously but 
the core will remain relatively coherent. The rise of a "maverick" 
that would try to form a monoclonal population diverging away 
from the homogenizing influence of the rest would be prevented 
by the excessive phage predation pressure coupled to less 
efficient exploitation of resources. This trend in the long run might 
be enough to provide the discontinuities required to form a natural 
species-like entity. 
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